108 research outputs found

    Unsupervised reduction of random noise in complex data by a row-specific, sorted principal component-guided method

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large biological data sets, such as expression profiles, benefit from reduction of random noise. Principal component (PC) analysis has been used for this purpose, but it tends to remove small features as well as random noise.</p> <p>Results</p> <p>We interpreted the PCs as a mere signal-rich coordinate system and sorted the squared PC-coordinates of each row in descending order. The sorted squared PC-coordinates were compared with the distribution of the ordered squared random noise, and PC-coordinates for insignificant contributions were treated as random noise and nullified. The processed data were transformed back to the initial coordinates as noise-reduced data. To increase the sensitivity of signal capture and reduce the effects of stochastic noise, this procedure was applied to multiple small subsets of rows randomly sampled from a large data set, and the results corresponding to each row of the data set from multiple subsets were averaged. We call this procedure Row-specific, Sorted PRincipal component-guided Noise Reduction (RSPR-NR). Robust performance of RSPR-NR, measured by noise reduction and retention of small features, was demonstrated using simulated data sets. Furthermore, when applied to an actual expression profile data set, RSPR-NR preferentially increased the correlations between genes that share the same Gene Ontology terms, strongly suggesting reduction of random noise in the data set.</p> <p>Conclusion</p> <p>RSPR-NR is a robust random noise reduction method that retains small features well. It should be useful in improving the quality of large biological data sets.</p

    Improving Interpretation of Cardiac Phenotypes and Enhancing Discovery With Expanded Knowledge in the Gene Ontology

    Get PDF
    This work was funded through grants from the British Heart Foundation (BHF, SP/07/007/23671, RG/13/5/30112) and the National Institute for Health Research University College London Hospitals Biomedical Research Centre; The Zebrafish Model Organism Database: National Human Genome Research Institute (NHGRI, HG002659, HG004838, HG004834); The Rat Genome Database: National Heart, Lung, and Blood Institute on behalf of the NIH (HL64541); The Mouse Genome Database: NGHRI (HG003300); FlyBase: UK Medical Research Council (G1000968); and Gene Ontology Consortium: NIH NHGRI (U41 HG002273) to Drs Blake, Cherry, Lewis, Sternberg, and Thomas. Professor Riley received BHF personal chair award (CH/11/1/28798). Professors Lambiase and Tinker received support from BHF and UK Medical Research Council. Professor Tinker received National Institute for Health Research Biomedical Research Centre at Barts and BHF grant (RG/15/15/31742). Dr Roncaglia received EMBL-EBI Core funds

    Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology

    Get PDF
    Background The Gene Ontology (GO) facilitates the description of the action of gene products in a biological context. Many GO terms refer to chemical entities that participate in biological processes. To facilitate accurate and consistent systems-wide biological representation, it is necessary to integrate the chemical view of these entities with the biological view of GO functions and processes. We describe a collaborative effort between the GO and the Chemical Entities of Biological Interest (ChEBI) ontology developers to ensure that the representation of chemicals in the GO is both internally consistent and in alignment with the chemical expertise captured in ChEBI. Results We have examined and integrated the ChEBI structural hierarchy into the GO resource through computationally-assisted manual curation of both GO and ChEBI. Our work has resulted in the creation of computable definitions of GO terms that contain fully defined semantic relationships to corresponding chemical terms in ChEBI. Conclusions The set of logical definitions using both the GO and ChEBI has already been used to automate aspects of GO development and has the potential to allow the integration of data across the domains of biology and chemistry. These logical definitions are available as an extended version of the ontology from http://purl.obolibrary.org/obo/go/extensions/go-plus.ow

    Local coexpression domains in the genome of rice show no microsynteny with Arabidopsis domains

    Get PDF
    Chromosomal coexpression domains are found in a number of different genomes under various developmental conditions. The size of these domains and the number of genes they contain vary. Here, we define local coexpression domains as adjacent genes where all possible pair-wise correlations of expression data are higher than 0.7. In rice, such local coexpression domains range from predominantly two genes, up to 4, and make up ∼5% of the genomic neighboring genes, when examining different expression platforms from the public domain. The genes in local coexpression domains do not fall in the same ontology category significantly more than neighboring genes that are not coexpressed. Duplication, orientation or the distance between the genes does not solely explain coexpression. The regulation of coexpression is therefore thought to be regulated at the level of chromatin structure. The characteristics of the local coexpression domains in rice are strikingly similar to such domains in the Arabidopsis genome. Yet, no microsynteny between local coexpression domains in Arabidopsis and rice could be identified. Although the rice genome is not yet as extensively annotated as the Arabidopsis genome, the lack of conservation of local coexpression domains may indicate that such domains have not played a major role in the evolution of genome structure or in genome conservation

    Annotation of gene product function from high-throughput studies using the Gene Ontology

    Get PDF
    High-throughput studies constitute an essential and valued source of information for researchers. However, high-throughput experimental workflows are often complex, with multiple data sets that may contain large numbers of false positives. The representation of high-throughput data in the Gene Ontology (GO) therefore presents a challenging annotation problem, when the overarching goal of GO curation is to provide the most precise view of a gene's role in biology. To address this, representatives from annotation teams within the GO Consortium reviewed high-throughput data annotation practices. We present an annotation framework for high-throughput studies that will facilitate good standards in GO curation and, through the use of new high-throughput evidence codes, increase the visibility of these annotations to the research community

    Insights into corn genes derived from large-scale cDNA sequencing

    Get PDF
    We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701–EU977132 (FLI cDNA) and FK944382-FL482108 (EST)

    Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO).

    Get PDF
    Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills

    Genome structure of cotton revealed by a genome-wide SSR genetic map constructed from a BC1 population between gossypium hirsutum and G. barbadense

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cotton, with a large genome, is an important crop throughout the world. A high-density genetic linkage map is the prerequisite for cotton genetics and breeding. A genetic map based on simple polymerase chain reaction markers will be efficient for marker-assisted breeding in cotton, and markers from transcribed sequences have more chance to target genes related to traits. To construct a genome-wide, functional marker-based genetic linkage map in cotton, we isolated and mapped expressed sequence tag-simple sequence repeats (EST-SSRs) from cotton ESTs derived from the A<sub>1</sub>, D<sub>5</sub>, (AD)<sub>1</sub>, and (AD)<sub>2 </sub>genome.</p> <p>Results</p> <p>A total of 3177 new EST-SSRs developed in our laboratory and other newly released SSRs were used to enrich our interspecific BC<sub>1 </sub>genetic linkage map. A total of 547 loci and 911 loci were obtained from our EST-SSRs and the newly released SSRs, respectively. The 1458 loci together with our previously published data were used to construct an updated genetic linkage map. The final map included 2316 loci on the 26 cotton chromosomes, 4418.9 cM in total length and 1.91 cM in average distance between adjacent markers. To our knowledge, this map is one of the three most dense linkage maps in cotton. Twenty-one segregation distortion regions (SDRs) were found in this map; three segregation distorted chromosomes, Chr02, Chr16, and Chr18, were identified with 99.9% of distorted markers segregating toward the heterozygous allele. Functional analysis of SSR sequences showed that 1633 loci of this map (70.6%) were transcribed loci and 1332 loci (57.5%) were translated loci.</p> <p>Conclusions</p> <p>This map lays groundwork for further genetic analyses of important quantitative traits, marker-assisted selection, and genome organization architecture in cotton as well as for comparative genomics between cotton and other species. The segregation distorted chromosomes can be a guide to identify segregation distortion loci in cotton. The annotation of SSR sequences identified frequent and rare gene ontology items on each chromosome, which is helpful to discover functions of cotton chromosomes.</p

    Expressed sequence tag analysis of khat (Catha edulis) provides a putative molecular biochemical basis for the biosynthesis of phenylpropylamino alkaloids

    Get PDF
    Khat (Catha edulis Forsk.) is a flowering perennial shrub cultivated for its neurostimulant properties resulting mainly from the occurrence of (S)-cathinone in young leaves. The biosynthesis of (S)-cathinone and the related phenylpropylamino alkaloids (1S,2S)-cathine and (1R,2S)-norephedrine is not well characterized in plants. We prepared a cDNA library from young khat leaves and sequenced 4,896 random clones, generating an expressed sequence tag (EST) library of 3,293 unigenes. Putative functions were assigned to > 98% of the ESTs, providing a key resource for gene discovery. Candidates potentially involved at various stages of phenylpropylamino alkaloid biosynthesis from L-phenylalanine to (1S,2S)-cathine were identified
    corecore